EDA for Street Cleaning Function & Strict Dictionary

Follow Up Action Required: Check if this is appropriate

The Street Cleaning Function has 3 lines of codes that were set to be ignored. The code were:

# Lines of codes that had "##" in the Street Cleaning Function
##x<-gsub("\\bSTR\\b","", x)
##x<-gsub("\\<ST\\>","",x)
##x<-gsub("\\<STREET\\>","",x)

The combined Street Dictionary had several issues regarding “STR”, “ST” and “STREET”. This might have been the cause of those issues.

The Street Dictionary (combined and full) for Manhattan and Brooklyn had several abnormalities. From manual checking the issues that were identified included:


EDA of output from the Manhattan cleaning script

Manhattan

From the sample dataset, an output of 557,357 rows by 33 columns was derived.

There were 1,478 Enumeration Districts, 1,276 microfilms and 8,557 unique street names within the dataset.

The results from the Street Matching function were as follow:

  • 28% Perfect Match
  • 12% Identical Match
  • 47.5% Singular Mode
  • 3% Multiple Modes
  • 1.2% NAs
  • 0.5% No Match
# Distribution of Result Type in Manhattan
plot(table(mn_output$result_type), 
     type = "h", 
     col = c("blue", "red", "orange", "purple", "green", "pink"), 
     lwd = 10,
     main = "Result Type for Manhattan",
     ylab = "Count",
     xlab = "Result Type")

For Reference, these are examples of the different result types:

Perfect Match:
# Sample of Result Type 1 (Perfect Match)
mn_output %>% filter(result_type == 1) %>% select(ED, street_add, best_match, result_type) %>% head(1)
Identical Match:
# Sample of Result Type 2 (Identical Match)
mn_output %>% filter(result_type == 2) %>% select(ED, street_add, best_match, result_type) %>% head(1)
Singular Mode Match:
# Sample of Result Type 3 (Singular Mode)
mn_output %>% filter(result_type == 3) %>% select(ED, street_add, best_match, result_type) %>% head(1)
Multiple Modes Match:
# Sample of Result Type 4 (Multiple Modes)
mn_output %>% filter(result_type == 4) %>% select(ED, street_add, best_match, result_type) %>% head(1)
“NA” Match:
# Sample of Result Type 1 (NA)
mn_output %>% filter(result_type == 5) %>% select(ED, street_add, best_match, result_type) %>% head(3)
No Match:
# Sample of Result Type 6 (no match)
mn_output %>% filter(result_type == 6) %>% select(ED, street_add, best_match, result_type) %>% head(4)

Problematic EDs

Out of the problematic Street Matches (result type 5 and type 6), there is a trend of some EDs being more problematic than others, i.e. more entries of 5 or 6 within certain Enumeration Districts.

# Problematic EDs for Result Type 5
# Treshold set to 50 (arbitrarily decided)
mn_output %>% filter(result_type == 5) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(29)
## .
## 0948 0836 1164 1393 0984 0365 0861 1267 0600 0288 0725 0081 0818 0049 1270 0009 
##  546  351  289  288  284  225  186  182  167  150  147  113  112  100  100   92 
## 0026 1392 0766 0421 0427 1679 1202 1643 0953 0394 0110 1377 1232 
##   88   87   85   82   78   77   74   74   72   65   54   54   50
# Problematic EDs for Result Type 6
# Treshold set to 50 (arbitrarily decided)
mn_output %>% filter(result_type == 6) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(6)
## .
## 0815 1481 1913 0764 0782 1941 
## 1432  522  248   84   83   67
# Average result_type by ED
average_result_type_mn <- mn_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(mean_result_type = mean(result_type))

average_result_type_mn$ED <- as.numeric(average_result_type_mn$ED)

mn_result_type_plot <- ggplot(average_result_type_mn, 
                              aes(x = ED, y = mean_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Average Result Type", title = "Mean Result Type by ED (MN)"
  )

ggplotly(mn_result_type_plot)
# Standard Deviation of result_type by ED
sd_result_type_mn <- mn_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(sd_result_type = sd(result_type))

sd_result_type_mn$ED <- as.numeric(sd_result_type_mn$ED)

mn_sd_result_type_plot <- ggplot(sd_result_type_mn, 
                              aes(x = ED, y = sd_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Standard Deviation of Result Type", title = "Standard Deviation of Result Type by ED (MN)"
  )

ggplotly(mn_sd_result_type_plot)

Fill Down Function - Manhattan

Out of the 28% of Perfect Matches (156k entries) 1,717 entries were matched via the Fill Down function 03_Matched_Street_Fill_Down - 1.1%.

Flagged House Numbers - Manhattan

Out of 557,357 entries, 159,669 entries were flagged for house number changes (i.e. 0 or 1) - 28.6%.

7,350 entries were flagged with “1” meaning the house number was editted by the function. E.g. The initial household number might have been “195-7”, “34 TO 36” or “112 114”. There would be a split between the house numbers and hn_1 will be 195 and hn_2 will be 197 as in the first example.

# Problematic Enumeration Districts for House Number Cleaning
mn_output %>% filter(flag_hn_cleaned == 1) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(24)
## .
## 0190 0984 0805 0843 0094 1267 0967 0212 0024 0586 1271 0403 1269 1073 0509 0256 
##  634  289  253  244  191  187  177  172  144  132  102  101   85   81   79   74 
## 1036 0681 1170 0786 0924 0442 0638 0768 
##   73   63   63   56   56   52   51   51

Out of all the house hold number entries, 364,730 entries were filled in via the Fill Down function 05_House_Number_Fill_Down - 65.4%.


Brooklyn

From the sample dataset, an output of 371,833 rows by 33 columns was derived.

There were 1,106 Enumeration Districts, 1,527 microfilms and 7,076 unique street names within the dataset.

The results from the Street Matching function are as follow:

  • 76.5% Perfect Match
  • 7.9% Identical Match
  • 9.5% Singular Mode
  • 1.5% Multiple Modes
  • 1.3% NAs
  • 0.8% No Match

In contrast to Manhattan, there is a larger proportion of Perfect Matches.

# Distribution of Result type in Brooklyn
plot(table(bk_output$result_type), 
     type = "h", 
     col = c("blue", "red", "orange", "purple", "green", "pink"), 
     lwd = 10,
     main = "Result Type for Brooklyn",
     ylab = "Count",
     xlab = "Result Type")

Problematic EDs - Brooklyn

Out of the problematic Street Matches (result type 5 and type 6), there is a trend of some EDs being more problematic than others, i.e. more entries of 5 or 6 within certain Enumeration Districts.

# Problematic EDs for Result Type 5
# Treshold set to 50 (arbitrarily decided)
bk_output %>% filter(result_type == 5) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(27)
## .
## 1007 0519 0573 0147 0552 1066 0150 0120 1412 0331 1038 0477 0491 0562 0905 0926 
##  401  203  198  171  149  147  137  106  103  101   82   78   75   73   63   63 
## 1028 0559 0580 1026 0807 1408 0935 0154 0987 0365 1000 
##   62   61   61   61   59   57   55   54   54   52   50
# Problematic EDs for Result Type 6
# Treshold set to 50 (arbitrarily decided)
bk_output %>% filter(result_type == 6) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(14)
## .
## 0915 0123 0608 0161 0058 1092 0253 0173 0909 0124 0547 0488 0733 0158 
##  252  131  110   99   88   88   83   74   69   63   62   60   56   55
# Average result_type by ED
average_result_type_bk <- bk_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(mean_result_type = mean(result_type))

average_result_type_bk$ED <- as.numeric(average_result_type_bk$ED)

bk_result_type_plot <- ggplot(average_result_type_bk, 
                              aes(x = ED, y = mean_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Average Result Type", title = "Mean Result Type by ED (BK)"
  )

ggplotly(bk_result_type_plot)
# Standard Deviation of result_type by ED
sd_result_type_bk <- bk_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(sd_result_type = sd(result_type))

sd_result_type_bk$ED <- as.numeric(sd_result_type_bk$ED)

bk_sd_result_type_plot <- ggplot(sd_result_type_bk, 
                              aes(x = ED, y = sd_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Standard Deviation of Result Type", title = "Standard Deviation of Result Type by ED (BK)"
  )

ggplotly(bk_sd_result_type_plot)

Fill Down Function - Brooklyn

Out of the 76.5% of Perfect Matches (284k entries) 1,005 entries were matched via the Fill Down function 03_Matched_Street_Fill_Down - A much smaller percentage than Manhattan. This suggests that the recording of entries might be more accurate in Brooklynn or that the street directory is more well-developed. Nevertheless, errors are still present.

Flagged House Numbers - Brooklyn

Out of 371,833 entries, 179,303 entries were flagged for house number changes (i.e. 0 or 1) - 48.2%.

2,567 entries were flagged with “1” meaning the house number was editted by the function. Brooklyn also has several abnormal initial house numbers. E.g. “222 1/2”, “192 3TH”. Otherwise, the issues are similar to Manhattan.

# Problematic Enumeration Districts for House Number Cleaning
bk_output %>% filter(flag_hn_cleaned == 1) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(11)
## .
## 0614 0016 0013 0456 0321 0518 0029 0358 0003 0239 0419 
##  322  156  145  109  103  102   80   80   77   77   68

Out of all the house hold number entries, 182,935 entries were filled in via the Fill Down function 05_House_Number_Fill_Down - 49.1%.


Visualizing the Outputs spatially

# Aggregate the mean of result type, i.e. street matching success rate
# The lower the better
mn_output_result_type <- aggregate(result_type ~ ED, FUN = mean, data = mn_output)
mn_output_result_type$ED <- as.numeric(mn_output_result_type$ED)

# Change Manhattan Shapefile ED variable to numeric format
mn_map@data$ED <- as.numeric(mn_map@data$ED)

# Left join the 2 datasets
mn_map@data <- left_join(mn_map@data, mn_output_result_type, by = c('ED' = 'ED'))
# Map for Result Type in Manhattan
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(mn_map) + tm_polygons("result_type")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
# Aggregate the mean of result type, i.e. street matching success rate
# The lower the better
bk_output_result_type <- aggregate(result_type ~ ED, FUN = mean, data = bk_output)
bk_output_result_type$ED <- as.numeric(bk_output_result_type$ED)

# Change Brooklyn Shapefile ED variable to numeric format
bk_map@data$ED <- as.numeric(bk_map@data$ED)

# Left join the 2 datasets
bk_map@data <- left_join(bk_map@data, bk_output_result_type, by = c('ED' = 'ED'))
# Map for Result Type in Brooklyn
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(bk_map) + tm_polygons("result_type")
# Highlight 2 examples in Brooklyn with poor street name matches
bk_output %>% filter(ED == 1034 | ED == 1039)

Other Problems to Look into

  • Missing EDs in the entire Dataset
    • There are missing EDs in the entire dataset
    • The EDs could be names wrongly
      • In the Shapefile, the max ED is 1480 but the max ED in the output runs till 1941
# Missing EDs in Manhattan

# Manhattan ED list
mn_ed_list <- mn_map@data$ED %>% sort()

mn_output_ed <- mn_output$ED %>% unique() %>% as.numeric() %>% sort()
# Which EDs are in the Shapefile but missing from the Manhattan dataset
mn_ed_list[!(mn_output_ed %in% mn_ed_list)]
##  [1] 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404
## [16] 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419
## [31] 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434
## [46] 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449
## [61] 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464
## [76] 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478
length(mn_ed_list[!(mn_output_ed %in% mn_ed_list)])
## [1] 89
# Which EDs are in the dataset but missing from the Manhattan shapefile
mn_output_ed[!(mn_ed_list %in% mn_output_ed)]
##  [1]  180  508  693  705 1022 1077 1181 1233 1284 1285 1653 1654 1655 1656 1657
## [16] 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672
## [31] 1673 1674 1675 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1688 1689
## [46] 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704
## [61] 1705 1706 1707 1708 1709 1710 1711 1712 1716 1717 1722 1723 1724 1725 1727
## [76] 1730 1731 1732 1733 1734 1736 1742 1743 1745 1747 1748 1756 1913 1941   NA
## [91]   NA
length(mn_output_ed[!(mn_ed_list %in% mn_output_ed)])
## [1] 91
# Missing EDs in Brooklyn

# Brooklyn ED list
bk_ed_list <- bk_map@data$ED %>% sort()

# Brooklyn Output EDs
bk_output_ed <- bk_output$ED %>% unique() %>% as.numeric() %>% sort()
# Which EDs are in the Shapefile but missing from the Brooklyn dataset
bk_ed_list[!(bk_output_ed %in% bk_ed_list)]
##  [1] 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106
length(bk_ed_list[!(bk_output_ed %in% bk_ed_list)])
## [1] 11
# Which EDs are in the dataset but missing from the Manhattan shapefile
bk_output_ed[!(bk_ed_list %in% bk_output_ed)]
##  [1]  763  951  961  964  965 1041 1087 1410 1411 1412 1413   NA   NA   NA   NA
## [16]   NA   NA
length(bk_output_ed[!(bk_ed_list %in% bk_output_ed)])
## [1] 17

Problematic Best Matches - Manhattan

There are 38,174 entries with NA best_match. Their street_add entries are also NA. There are 792 EDs with these problems.

na_mn_output <- mn_output %>% filter(is.na(best_match)) %>% select(ED) %>% unique()

na_mn_output <- na_mn_output %>% mutate('NA best match' = 1)
na_mn_map <- merge(mn_map, na_mn_output, 'ED', 'ED')
# Map for NA Street Address / Best Match in Manhattan
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(na_mn_map) + tm_polygons("NA best match")
# Another problematic best match is "0"
mn_output %>% filter(best_match == 0) %>% select(street_add) %>% unique()

Problematic Best Matches - Brooklyn

There are 9,264 entries with NA best_match. Their street_add entries are also NA. There are 427 EDs with these problems.

na_bk_output <- bk_output %>% filter(is.na(best_match)) %>% select(ED) %>% unique()

na_bk_output <- na_bk_output %>% mutate('NA best match' = 1)
na_bk_map <- merge(bk_map, na_bk_output, 'ED', 'ED')
# Map for NA Street Address / Best Match in Brooklyn
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(na_bk_map) + tm_polygons("NA best match")